m 



o 






^1- 

o 



x 

S3 



On the Data Processing Theorem in the Semi-Deterministic Setting 

Neri Merhav 



o 

(N 

d . E-mail: merhav@ee.technion.ac.il 



Department of Electrical Engineering 

Technion - Israel Institute of Technology 

Technion City, Haifa 32000, ISRAEL 



Abstract 



Data processing lower bounds on the expected distortion are derived in the finite-alphabet 
semi-deterministic setting, where the source produces a deterministic, individual sequence, but 
the channel model is probabilistic, and the decoder is subjected to various kinds of limitations, 
e.g., decoders implementable by finite-state machines, with or without counters, and with or 
without a restriction of common reconstruction with high probability. Some of our bounds are 
given in terms of the Lempel-Ziv complexity of the source sequence or the reproduction se- 
quence. We also demonstrate how some analogous results can be obtained for classes of linear 
encoders and linear decoders in the continuous alphabet case. 



Index Terms: Data processing theorem, finite-state machine, Lempel-Ziv algorithm, redun- 
dancy, delay, common reconstruction. 

1 Introduction 

In a series of articles from the seventies and the eighties of the twentieth century, Ziv [10], [11], [12], 
and Ziv and Lempel [3], [13], have created a theory of universal source coding for individual se- 
quences using finite-state machines. In particular, the work [10] focuses on universal, fixed-rate, 
(almost) lossless compression of individual sequences using finite-state encoders and decoders, 
which was then further developed to the famous Lempel-Ziv algorithm [3], [13]. In [11], the frame- 
work of [10] was extended to lossy coding for both noiseless and noisy transmission (subsections II. A 
and II. B of [11], respectively), and later further extended in other directions, such as incorporation 
of side information in the context of almost lossless compression, where the side information data 



is also modeled as an individual sequence [12], in other words, an individual-sequence counterpart 
of Slepian-Wolf coding [8] was studied in [12] (see also a later extension to the lossy case [7]). 

The main trigger for this paper stems from the coding theorem for noisy transmission in [11, 
Subsection II. B]. We begin by revisiting the assertion and the proof of the converse part of this 
theorem (Theorem 3 and eqs. (12) and (13) in [11]), which provides a lower bound on the distortion 
in a semi-deterministic setting, where the source emits a deterministic (individual) sequence, but 
the channel model is probabilistic as usual (in particular, it is a discrete memoryless channel) and 
the encoder and decoder are limited to be finite-state machines with no more than s states and a 
given overall delay, which we shall denote by d. While this theorem is essentially correct, it turns 
out that there are certain imprecise steps in its proof (see Appendix for details) and moreover, in 
relation to our corrections to this proof, the assertion of the theorem itself can be strengthened 
and sharpened. The revisited converse theorem imposes no limitations on the encoder, 1 and allows 
the decoder to be equipped with a modulo-^ counter (l - positive integer) in addition to its s 
states of memory, which means that within each period of length £, the decoder is allowed to be 
time-varying, as opposed to the time-invariant model used in [11] and in related papers. 2 Also, 
our lower bound on the distortion depends, not only on the number of states s (as in [11]), but also 
on the allowed delay d (as well as on some additional redundancy terms). 

Beyond the above described revisit of Theorem 3 of [11], we also derive additional lower bounds 
on the expected distortion in the semi-deterministic setting. One of them is associated with a 
restriction of a common reconstruction (with high probability) at both encoder and decoder, which 
is a setup that has recently received some attention in other contexts, like the Wyner-Ziv problem 
(see e.g., [9]), with motivations in medical imaging, etc. In addition, some of our bounds are 
given in a more explicit form, in terms of the Lempel-Ziv complexity of the source sequence or 
the reproduction sequence. This may be interesting in the sense that the Lempel-Ziv complexity 
usually arises when the finite-state structure is imposed on the encoder, whereas in our case, it is 
imposed on the decoder. Finally, we demonstrate how some analogous results can be obtained for 



lr The assumption that the encoder is a finite-state machine is not really used in [11] either, 

2 One might argue that a finite-state machine with s states and a modulo-^ counter is just a particular finite-state 
machine with a total number of s ■ I states. While this argument is true, in principle, the idea is that this partition 
of the total number of allowed states between those that are allocated to implement a clock (the counter) and those 
that are allocated to memory of past input data (the remaining s states) give us more detailed and more refined 
results. 



classes of linear encoders and linear decoders in the continuous alphabet case. 

It should be emphasized that our focus in this paper is primarily on lower bounds and converse 
theorems, and not quite on achievability schemes. Most of our bounds can be asymptotically 
approached by conceptually simple, separation-based schemes, in the spirit of the one proposed in 
[11] or with certain modifications and variations on the same ideas. 

The outline of this paper is as follows. In Section 2, we establish notation conventions and for- 
malize the semi-deterministic setting under consideration. In Section 3, we derive a lower bound 
on the distortion without the common reconstruction requirement, and in Section 4, we derive the 
parallel lower bound under common reconstruction. In both sections, we also derive the aforemen- 
tioned alternative lower bounds, which can be calculated more easily. Finally, in Section 5, we 
give an outline of an analogue of the main result of Section 2 for continuous alphabets and linear 
encoders and decoders. 

2 Problem Formulation and Notation Conventions 

Throughout the paper, random variables will be denoted by capital letters, specific values they may 
take will be denoted by the corresponding lower case letters, and their alphabets will be denoted 
by calligraphic letters. Similarly, random vectors, their realizations, and their alphabets, will 
be denoted, respectively, by capital letters, the corresponding lower case letters, and calligraphic 
letters, all superscripted by their dimensions. For example, the random vector Y n = (Y±, . . . ,Y n ), 
(n - positive integer) may take a specific vector value y n = (yi, ■ ■ ■ ,y n ) m 3^ ra , the n-th order 
Cartesian power of y, which is the alphabet of each component of this vector. For i < j (i, j - 
positive integers), x\ will denote the segment (xi, . . . ,Xj), where for i = 1 the subscript will be 
omitted. 

Let u = (ui,U2, ■ ■ ■) be an individual source sequence of symbols in a finite alphabet U of 
cardinality \U\ = J. The sequence u is encoded using a general encoder, whose output at time t is 
xt £ X, where X is another finite alphabet 3 of size \X\ = K. The sequence x = (xi,X2, . . .) is fed 



3 In the general formulations of the joint source-channel coding problem, the source and the channel are allowed 
to operate at different rates, and then, in the case of block codes, source blocks of a given length may be mapped into 
channel blocks of a different length. This degree of freedom, however, is essentially available here too, by redefining 
U and X to be superalphabets of the appropriate sizes. 



into a discrete memoryless channel (DMC), characterized by the matrix of single-letter transition 
probabilities {P(y\x), x £ X, y £ y}, where the output alphabet y is a finite alphabet of size 
\y\ = L. The channel output y = (2/1,2/2, • • •) is i n turn fed into a finite-state decoder, which is 
defined by the following recursive equations: 

v t -d = f(zt,Vt), t = d+l,d + 2,... (1) 

zt+i = g(z t ,yt), * = 1,2, ... (2) 

where zt £ Z is the decoder state at time t, Z being a finite set of states of size s, Vt-d £ V is 
the reconstructed sequence, delayed by d time units (d - positive integer) and f : Z x y ^- V and 
g : Z x y — > Z are the output function and the next-state function, respectively. The reconstruction 
alphabet V of size M. 

A slightly more sophisticated model allows the decoder to be equipped with a modulo-^ counter, 
in addition to its state variable. This means that the functions / and g are allowed to be time- 
varying within each period of length £. In particular, in this case, the decoding equations would 
admit the form: 

r = f mod f, t = 1,2, ... (3) 

vt-d = fr(z t ,yt), t = d + l,d + 2,... (4) 

zt+i = 9r{zt,yt), t = l,2, ... (5) 

In some applications, one may be interested in a common reconstruction at both the encoder 
and decoder (with high probability). In our context, this means that for a certain positive integer, 
which we will choose to be £, there is a deterministic function q : U^ — > V e such that 

nil— 1 

^ E PKVffiV *(«{&* )> = «> (6) 

where here and throughout the sequel, probabilities and expectations are defined with respect to 
(w.r.t.) the randomness of the channel. This means that there is a target reconstruction v n , 
obtained by n/£ successive applications of q(-), i.e., #^ 1 = 9(^+1), i = 0, 1, 2, . . . , n/£ — 1, such 
that V n is very close to v n in the sense of eq. (6). For example, in the traditional coding theorem of 
joint source-channel coding, this is achieved by separate source- and channel coding, where v\^i x 
are rate-distortion reproduction codewords of u^t v respectively. 



For a given distortion measure p : U x V — > IR, we are interested in deriving lower bounds 
on the minimum achievable expected distortion, - J2t=i E{p( u t, ^4)}j as functions of the alphabet 
sizes, the number of stated s, the allowed delay d, and the period £, if applicable, with/without 
a modulo-^ counter at the decoder, and with/without the requirement of common reconstruction 
with high probability. 

Throughout our assertions and derivations, we will make heavy use of the following additional 
notation. Assume, without essential loss of generality, that £ divide n and consider the segmentation 
of each n-vector to n/£ non-overlapping blocks of length £, that is, 

u n = (u ,ui,. . . ,u n /e-i), Ui = (u ie+1 ,Uie +2 ,... ,Uie + e), i = 0,l,... ,n/£- 1, 

and similar definitions for x n , y n , and v n , where v n -d+i, w n -d+2j ■ ■ ■ ,v n (which are not yet recon- 
structed at time t = n) are defined as arbitrary symbols in V. Let us define the empirical joint 
probability mass function 

£ n,i ~ 1 

PuexlYlytziu 1 ,x e ,y £ ,V £ , z) = - Y^ X ( u i = u ^ x i = x ^Vi = V^ v i = v ^ Z H>+1 = z ), ( 7 ) 

n i=0 

where X(-) is the indicator function of an event. Correspondingly, unless specified otherwise, U , X , 

Y , V and Z are understood to be random variables jointly distributed according to Pjjtx^Y l v l z 
and all information measures associated with them will be denoted as in the customary notation 
conventions of the information theory literature, but with "hats", for example, H(U^) is the em- 
pirical entropy associated with U , 1(X^; Y^) is the empirical mutual information between X e and 

Y , and so on. Accordingly, the £-th order empirical rate distortion function, associated with u n 
and distortion measure p, is defined as 

iV(Z))=minji/(^;Y £ ): Ep(U e ;V e ) < d}, (8) 

where V is a generic random variable (not to be confused with V , which is defined empiri- 
cally), taking on values in V , the mutual information I(U e ; V^) and expected distortion Ep^U^, V ) 
are defined w.r.t. PjjePy£,jji, and the minimization is across all conditional distributions Py e , ue . 
Here, p(U , V ) is defined additively over the corresponding components of both vectors. Similarly, 
Djje(R) is the corresponding distortion-rate function, which is the inverse of Rue(D), and which is 
defined as 

D ut (R) = mm I^EpiU^V*) : I(U e ;V e ) < £r} . (9) 



In the sequel, we will define some additional empirical rate-distortion functions and distortion-rate 
functions, with certain modifications of the above definitions. 

3 Distortion Bounds Without Common Reconstruction 

We begin from the simpler case where there is no requirement of common reconstruction. Our first 
result is the following: 

Theorem 1 Consider the communication setting described in Section 2. Let u n be an individual 
sequence, let C be the capacity of the discrete memoryless channel, and let the overall coding- 
decoding delay be d. Then, for every decoder with s states and a modulo-t counter, 

\ ± E{ P (u t , V t)} > D ut (c + 21 ° gg+ / l0gM + W »)) • ( 1Q ) 



(J KY log L (JKLY loge / 1 
Sl «> n)= JZ + 2n +0 {7E ■ (11) 



where 



The interesting term, in the argument of the function Djji(-), is the second one, namely, the term 
(2 logs + dlogM)/£, which seemingly plays a role of an effective "extra capacity" contributed 
by the state variable, that carries memory of past data from block to block and by the allowed 
delay. This happens because the lower bound holds for every individual sequence u n and every 
encoder and decoder in the allowed class, including ones that happen to be 'tailored' to u n in a 
certain sense (for example, the finite-state machine at the decoder may be designed to periodically 
produce a certain pattern that happens to be repetitive in u n ). The dependence on £ is much 
more complicated, because £ appears also in the additional term S±(£, n), and more importantly, in 
the function Djji{-) itself. The lower bound is not necessarily a monotonically decreasing function 
of £, but this should not be surprising since the real optimum performance need not have such a 
monotonicity property either. For example, if u n happens to be periodic (or almost periodic) with 
period £, it seems plausible that it will be reproduced better by a decoder with a modulo-^ counter 
than by one with a modulo-(£ + 1) counter, which obviously cannot keep the synchronization with 
u n . In the absence of a modulo-^ counter at the decoder, Theorem 1 still applies, but then £ 
becomes just a parameter of the bound, with no apparent operative significance, and since the real 



distortion, - 2~2t=i E{p( u t, Vt)}i is then independent of £, one may maximize the lower bound w.r.t. 
£ over a certain set of divisors of n, for which n/£ is still appreciably large, such that the o(l/y/n) 
term would remain negligible. 

Proof of Theorem 1. First, observe that since Pjjtx e Y l v e z is a legitimate probability distribution, 
all the rules of manipulating information measures (the chain rule, condition reduces entropy, 
etc.) hold as usual. We will make use of the fact that v~ = (v^+i, . . . , Vn.+z-d) is a deterministic 
function of j/j and Zj£ + i and therefore (U , X*) — > Y — > V^~ d is a Markov chain under Pjjtx l Y l v l \Zi 
where V is random vector formed by the first £ — d components of V (and similarly, below, 
V[_ d+ 1 will denote the vector formed by the remaining d components). We then have the following 
chain of inequalities 

I(U e ;V e - d \Z) < I(U e ;Y e \Z) (12) 

< I(U e ,X E ;Y e \Z) (13) 
= H(Y e \Z)-H(Y e \U t ,X e ,Z) (14) 

< H{Y e )-H(Y e \U e ,X e ) + I{Z;Y e \U e ,X e ) (15) 

< HiY^-HiY^U^X^ + logs. (16) 

On the other hand, 

i(U e ;V e - d \Z) = H(U e \Z)-H{U e \V i - d ,Z) (17) 

> H(U e ) - I(Z;U e ) - H(U e \V £ - d ) (18) 

> H(U e )- log s-H(U e \V e )- I (V e e _ d+1 ;U e \V £ - d ) (19) 

> /([/ £ ; V e )- logs -d log M, (20) 

and so 

I(U e ;V e ) < H{Y e ) - H(Y e \U e ,X e ) + 2logs + dlogM. (21) 

Taking now the expectation of both sides, we get 

EI(U e ;V £ ) < EH(Y e )-EH(Y £ \U £ ,X e ) + 2logs + dlogM 

< H(Y £ ) - EH (Y £ \U £ ,X e ) + 2logs + dlogM (22) 



where in the second line, H(Y e ) is the entropy of Y e that is induced by P x e and the real channel 
Pyi\x l - Here we have used the fact that H(Y e ) is a concave functional of Pye\x e - As f° r the 
evaluation of EH(Y \U ,X ), we invoke the following result (see [1], [2] and [19, Proposition 5.2] 
therein, as well as [6, Appendix A]): Let P n be the first order empirical distribution associated with 
an n-sequence drawn from a memoryless m-ary source P. Then, 



which is equivalent to 



n.^(P w ||P) = (m -y i0ge + o(l), (23) 



(m — 1) loge / 1". 



where H is the corresponding empirical entropy and H is the true entropy. We now apply this 

result to the 'source' P(y \u , x ) = P(y |x ) for every pair (u ,x) that appears more than en/£ 
times as ^-blocks along the (deterministic) sequence pair (u n ,x n ). 

EHiY^U^X 1 ) (25) 

= E | £ iW(uV)£(^V = „<,*< = *4 (26) 

> E P^^(«'y)^^(^V = w'»^ = ^) ( 27 ) 

(L £ - 1) log e 



]T iW(«V) 



£ „t\ 



H{Y l \X l = x 



2nP u i x i{u i ,x i )/e \nej_ 



(28) 



> E Pu^ymY^ = A - &^1 _ (A) ( 29) 

{u e ,x e : P u i x i{u 1 ,x 1 )>e\ 

> E W^y^V = ^) - E W(«V)#(*i*' = A 

u*,xt {u e ,x e : P uixi (u e ,x e )<e} 

£{JKL) e log e 



•U) (30) 

> Hp-pf) - e( Jg )' . «0gL - " Jif f b " - o (A) (31) 

2n V^e/ 

= iJ(^|X^)-f-«5 (e,An), (32) 

where we have defined 

5 (eXn) = e(JKYlo g L + {JKL / lOSe + o(±-). (33) 

In \nej 



Taking e = 1/y/n, we define: 

On substituting the inequality 

EH(Y e \U e , X e ) > H{Y e \X e ) - eS^l, n) (35) 

into eq. (22), we get 

Ei(U L ,V £ ) < I(X i ;Y i ) + 2logs + dlogM + £S 1 (£,n) (36) 

< £C + 2logs + dlogM + £5 1 (£,n). (37) 

Now, denoting by E the empirical expectation (w.r.t. Pu e x l Y i v l z)i we obviously have 

EI(U e ;V e ) > £-ER u e(jEp(U e ,V e )\ (38) 



■ER ue i^-Y,p{u u V t )\ 



(39) 



^ Z-Rm(^ilEp{u u v t )Y (40) 

where in the last line, we have used the convexity of the rate-distortion function. Finally, we get 

%, (lf E » t y,))<c + i '°" t / k8 " rtM, (41) 

or 

IX>( Ul ,F<) > D VI (c + 2]ot ' + t d]osM + W^)) ■ (42) 

This completes the proof of Theorem 1. □ 



While the lower bound of Theorem 1 is not quite explicit (primarily because of the complicated 
dependence of the function Due(-) on £ when u n is arbitrary), we next propose an alternative lower 
bound, which is simpler and more explicit. The price of this simplicity, however, is a possible loss 
of tightness, The idea is based on the Shannon lower bound. Suppose that U = V is a group and 
the distortion measure p(u, v) depends only on the difference u — v for a well defined subtraction 
operation on the group (e.g., subtraction modulo J). Accordingly, we denote p(u,v) = q(v — u). 



We define the function &(D) to be the maximum entropy of a random variable W over an alphabet 
of size J, subject to the constraint Eg(W) < D. We also define 

*(x) = { ,_i, , ^ n (43) 

v ' y $ x (x) x > o v y 

Then, our next result is the following. 

Theorem 2 Consider the communication setting described in Section 2. Let u n be an individual 
sequence, let C be the capacity of the discrete memoryless channel, and let the overall coding- 
decoding delay be d. Then, for every decoder with s states and a modulo-t counter, 



n S V n 



where c{u n ) is the number of phrases associated with incremental parsing [13] of u n and 

st* n *<„ s 2^(l+logJ) 2 2^J 2 ^logJ 1 

S 2 (i,n) = S l{ e,n) + (1 l _ en)log J n + —^~ + j, (45) 

e n being a positive sequence tending to zero as n — >■ oo. 

An important feature of this bound is that the dependence on £ is now fairly explicit as it appears 
only in the expression 52(£,n) + (2 logs + dlogM)/£, and so, the effect of the choice of £ can be 
better understood. Indeed, for decoders that are not equipped with a counter, the maximization 
of the bound over £, which is equivalent to the minimization of 62(£,n) + (2 logs + dlogM)/£, 
is easier now. In particular, it is clear that £ should be o(logn) for this expression to vanish as 
n —7- oo. Another interesting point here is that the bound depends on u n only via its Lempel-Ziv 
complexity, c(u n ) log c(u n )/n. This is not a trivial fact, because the Lempel-Ziv complexity refers 
to the compressibility of u n using finite-state encoders, whereas here, the encoder is not limited to 
be a finite-state machine - only the decoder has such a limitation. 

Proof of Theorem 2. Defining V^ — U l as the component-wise difference between the two vectors, 
we have: 

£.R ue (D) = H(U e )-max{H(U e \V e ) : Eg(V e -U e ) < £D} (46) 

= H(U e ) - max{H(V e - U e \V e ) : Eg(V l - U e ) < £D} (47) 



10 



= H(U e ) - m&xiHiW^V 1 ) : Eg{W l ) < £D} (48) 

> H(U e ) - max{H(W e ) : Eg(W e ) < CD} (49) 



(50) 
(51) 



> H(U e ) -max I J2H(Wi): ^ Eg(Wi) < ID 1 

U=i «=i J 

> H(U e ) -max I J2^(Eg(Wi)): J2 E ^ W ^- £D \ 

U=i i=i J 

> tf(£/ € ) - max J ^ • $ (y^^(Wj)j : J2 E ^ W ^- £D \ (52) 

= H{U e )-£-<S>(D), (53) 

where in the last two lines, we have used concavity and the monotonicity of $(■), respectively Now, 
it is shown in [5, eq. (21)] (see also [4]) that 

c(u n ) log c(u n ) 



H(W) > £ 
where 



n 



8(£,n) 



(54) 



21(1 + log J) 2 2^bgJ 1 

<>(«, n) = 7" r-j 1 h -7, (55) 

(1 — e„)logn n l 

e n being a positive sequence tending to zero, and c{u n ) is the number of phrases in u n resulting 
from Lempel-Ziv incremental parsing. Thus, 

E Ru* (I E "* = in P(Vt - «t)) > C(nn)1 ° gCK) - S* (^ E <W - «,)) - *(/, n) (56) 

and we end up with 

* f I £ Ee( V« - „,)) > C '"""° g ' : '""» - C - 21 ° g ' + /'° gM - W , „) (58) 



n — / n 



or 

n 



1 £ Ee( v t - u t} > * ( *■•> ■**■•> - c - 21 °^ + /'° gM - W ») ) • (») 



n— v n 

This completes the proof of Theorem 2. □ 

4 Distortion Bounds Under Common Reconstruction 

Consider next the case where, in addition to the above-mentioned limitations on the decoder, an 
additional constraint is imposed, which is the constraint of almost deterministic reconstruction at 

11 



the level of ^-blocks. This setting is formalized as follows. For a given vanishing sequence e n £ [0, 1], 

we insist that 

n/l-l 

EPr{V £ + n ee £ £ Pr ^m ^ O < e - (60) 

n i=0 

where F = q(U ) (and #^i = q{u\gt\)), for some deterministic function q, is the target recon- 
struction. We will assume, in this section, that /9 max = max U) „ p(u, v) < oo. Our lower bound for 
this case is given by the following theorem. 

Theorem 3 Consider the communication setting described in Section 2. Let u n be an individual 
sequence, let C be the capacity of the discrete memoryless channel, and let the overall coding- 
decoding delay be d. Then, for every decoder with s states, a modulo-l counter and a common 
reconstruction constraint defined as in eq. (60): 

X - JT E{p(u t , V t )} > D ue (c + 21 °g* + dl °g M + W n) + 2A(en) ^ _ pmax6n) (61) 

where A(e n ) = h^i^-n) + e„^logJ, /i2( - ) being the binary entropy function, and 

D ut (R) = mm^Ep(U e ,q(U e )) : H(q(U £ )) < «?} . (62) 

Proof of Theorem 3. First, under the assumption of common reconstruction (60), one readily finds, 
using Fano's inequality, that 

EH(V l \U e ) < A(e n ), (63) 

where the concavity of the function A(-) was used in order to insert the expectation into the 
argument of this function in order to get the real probability of error. Thus, 

EI(U e ;V e ) = EH(V e )-EH(V e \U e ) (64) 

> EH(V e ) - A(e n ). (65) 

Now, 

EH{V l ) > H(V e ) - EHty^V 1 ) (66) 

> H(V e )-A(e n ) (67) 



and so, 



EI(U e ;V t )>H(V e )-2A(e n ). 
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Now, observe that 

-i>(«t,«t) = eDp^MU*))) (69) 



n t=i 



= 2 E iW«(*VM«V) + 

{(«',t/): g(«^)=t;*} 

± £ P ultV ttfJ)p(rS,qtf)) (70) 

^ £ S^.V^^yMw'.^+Anax- E ^.vfa'V) ( 71 ) 



1 ™ 



- £ p(«t, ty + Pmax • Pr{y £ / V*} (72) 



n t=l 



and so, taking the expectation of both sides, we get 



-. n -, n 

- Y^ P( n *> V t )<-^2 E P( u t' V t) + PmaxCn- (73) 



Thus, defining 



we readily have 



n S n t=i 



%/(£>) = min{^(g(^)) : Ep(U e ,q(U e )) < £D}, (74) 



EI(U e ;V e ) > H(q(U e )) - 2A(e n ) (75) 

^ ^(^X>K^)J -2A(e n ) (76) 

> ^ (^E^P(^,^) + Pmaxenj -2A(e„). (77) 

This means, of course, that 

^bw t/^ n (n i 21ogs + dlogM 2A(e TO ) \ 

-2^-Ep(«t,n) > Dfj< I CH hd 2 (£,n) H — I - p max e n , (78) 

completing the proof of Theorem 3. □ 

Here too, performance can be expressed in terms of Lempel-Ziv complexity, as H(q(U ))/£ > 
[c(v n ) log c(v n )]/n — 5'(£,n), where S'(£,n) is defined just like S(£,n), but with J replaced by M. 
Thus, 



1 n 



EI(U e ; V e ) > £R LZ - J2 E P( u t, Vt) + p mSLX e n 



, n t=i 



U n )-2A(e n )-5'(£,n), (79) 
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where 

r (nun - min f c ^") log c ^") ■ a«+< - „r„«+^ 

K L z{D\u ) - mini . v u+1 - q(u u+1 ), 



q I n 

1 n 1 

i = 0,l,...,n/£-l, -J2p(ut,vt)<D\. (80) 

Note that in Section 3, we were able to get bounds on the expected distortion, thanks to the 
convexity of Rijt{-) and the concavity of <£(•), whereas now, we obtained such a bound by using the 
proximity between the actual expected distortion and the distortion between u n and its intended 
reconstruction v n . 

5 Linear Encoders and Decoders 

So far, we have dealt with finite alphabets only. It is possible to derive analogous results for 
continuous alphabets, if the encoder and decoder are limited to be linear. In this section, we 
provide a brief outline how this can be done, by presenting a parallel result to the Theorem 1. 

Consider the following structure: The encoder is given by 

oo oo 

x t = ^2aiXt-i + ^2biU t -i, (81) 

i=l j=0 

where {aj} and {b{\ are real- valued parameters, chosen such that the encoder would satisfy a 
certain input constraint. The finite-state decoder we had before is replaced by a decoder with the 
same structure, except that now / and g are linear functions (i.e., state-space representation): 

v t _ d = az t + (3y t (82) 

zt+i = JZt + Sy t . (83) 

We will assume, for the sake of simplicity, that ut, xt, yt, vt and zt are all real-valued variables 
(scalars), although our discussion can be generalized to the vector case (ut,vt £ IR , xt,yt £ HI" 1 , 
zt £ IR P , k, m and p positive integers), in which case, {aj}, {6j}, a, f3, 7 and 5 become matrices of the 
corresponding dimensions. The channel is assumed to be a discrete-time AWGN, i.e., Yt = xt + Nf, 
where Nt is a stationary, i.i.d. zero-mean Gaussian process with variance a 2 . 



4 For simplicity, we now refer to the one without the modulo-i? counter. 
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Consider first 5 the case where {ut} is a zero-mean, stationary Gaussian process, independent 
of {Nt}, and so, its notation is temporarily changed to {Ut}. Consequently, all other signals in 
the system become random processes, and accordingly, their notation here will use capital letters. 
Due to the linearity of the systems, {(Ut,Xt,Yt,Vt,Zt), — oo < t < 00} are jointly Gaussian 
processes. We assume that these processes are jointly stationary. We also assume that the system 
is non-degenerated 6 in the sense that 

e| = lim mmse{Z 1 \U^,X?,Y{ 1 } > (84) 

and similarly 

e v = lim mmselVbllCn 1 ,^} > 0, (85) 

where mmse{yl|-B} = E[A — E(A\B)] 2 designates the minimum mean squared error in estimating a 
random variable A from another random variable B, and where the limits obviously exist due to the 
non-increasing monotonicity of ram.se{Zi\Ui , X™, Y"™} and mmse{Vo|VC^ , U_ n } as functions on n. 
The parameters e 2 z and €y are constants that depend on the auto-correlation function of the source, 
on the noise variance of noise, a 2 , and on the parameters of the encoder and decoder, {ai}, {bi}, 
a, (3, 7 and 5. Obviously, e^ < a 2 z and €y < ay, where a 2 z and a v are the variances of Z% and Vt, 
respectively. We define U t = (U 1 ,..., U e ), X i = (X 1 ,..., X e ), Y i = (Y 1 ,..., Y e ), V 1 = (V 1 ,..., V e ), 
and Z = Z\. We begin similarly as in eqs. (12), but the last step must be modified slightly: 

I(U e ;V e - d \Z) < I(U e ;Y e \Z) (86) 

< I(U e ,X e ;Y e \Z) (87) 
= h(Y e \Z)-h(Y e \U e ,X e ,Z) (88) 

< h(Y e )-h(Y e \U e ,X e ) + I(Z;Y e \U e ,X e ) (89) 
- h(Y )-h{Y\X) + - log mmse{z|[/ ^ xtj W} (90) 

< /(^;^) + ilog4 (91) 



2 



< lC + -\og^. (92) 



5 This assumption will be dropped soon. 

6 For example, if 7 = 8 — and hence Zt = 0, or if a = fi — and hence Vt = 0, the system is obviously 
degenerated. 
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On the other hand, 

I(U L y- d \Z) = h(U e \Z)-h{U e \V e - d ,Z) (93) 

> h(U £ \Z) - h{U e \V £ ~ d ) (94) 

= h(U e ) - I(Z; U e ) - h{U e \V e ) - I(V/_ d+1 ; U e \V e ~ d ) (95) 

e 
= h(U e )-h(U e \V e )-I(Z;U e )- Y, HVxU'lV*- 1 ) (96) 

i=e-d+i 

> /(l7<;^)-i]og4-£log4. (97) 

z e z z e v 

and so, 

/(^;^)<^C + log4 + ^log4- ( 98 ) 

e z l € v 

This is quite analogous to the bounds we obtained in the finite-alphabet case, but now logs and 
log M are replaced by log — and log — , respectively, thus — and — play roles of effective alphabet 
sizes (or effective resolution levels) of the variables Zt and Vt, respectively. Now, clearly, in the 
Gaussian case, I(U e ; V e ) depends on the joint density of (U , V ) only via the covariance matrix of 
this random vector. Equivalently, consider the class of Gaussian channels from U to V , defined 
by 

V e = GU e + W l (99) 

where G is a deterministic £ x £ matrix and W e is a zero-mean Gaussian vector, independent of 
U , with covariance matrix E^. Denoting the covariance matrix of U e by Sy, then 

I(U"; V*) = I log det(G d ^| ^ = \ ^ det(J + E^GE^) (100) 

Thus, defining 

R e (D)= min (llogdet^ + S^GSc/G 7 ) : £p([/ £ , V<) < ££>} , (101) 

where p designates the quadratic distortion measure (or any other distortion measure that such 
that Ep{U l , V^) depends only on the covariance matrix of (U , V )), we have 

Ri(D) <C + - e log^ + ^log^, (102) 

or 

\ J2 Ep(U t , V t ) > De (c + ]log^- + ± log U f\ , (103) 
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Now, the l.h.s. of (102) depends only on the covariance matrix of the source, whereas C (or 
I(X^;Y i )) depends only on the covariance matrix Sx of X 1 and the covariance matrix S^r of 
the noise vector, which we have taken to be a 2 1. Since the encoding and decoding systems are lin- 
ear, the auto-correlation cross-correlation functions of their outputs depend only on those of their 
inputs (for a given linear encoder and decoder), no matter whether these processes are Gaussian or 
not. The expected distortion also depends on the joint density of U , V ) only via the variances and 
covariances of their components. Consequently, at this point, the Gaussian assumption becomes 
immaterial. The source U may have any pdf with a given covariance matrix T,jj. In particular, 
we can take T,jj to be the empirical covariance matrix of a deterministic source sequence u n . In 
this case, in the above chains of inequalities, all information measures should be replaced by their 
empirical counterparts, which depend on the empirical covariances instead of the true covariances. 
The only exception is that, similarly as in the finite alphabet case, in eq. (86), it is no longer 
true that h(Y \X , U ) = h(Y \X ), since there might be empirical correlations between the source 
vector and the noise vector. However, Eh(Y £ \X £ , U*) tends to h(Y e \X^) by the weak law of large 
numbers, so as before, upon taking expectations, one can obtain a distortion bound analogous to 
the one we obtained in the finite-alphabet case. In particular, for the quadratic distortion measure, 
we have: 

-, n ( , n/t-l 

WEp{ut,V t ) > nnn i^^S-^ + O: 
n t=1 G^ w y n , =0 



^ log det(J + T^GZuG 1 ) < C + - log -f + - log -f + e n (104) 

— Zj V ) 

= mm\tT{(G-I)t u (G T -I) + ^ w }: 

±logdet(I + XrfGt u Gf r ) < C+ llog^ + ^log^f +6 n | , (105) 

— Zj V ) 

where flu = — Y^i=o u i u J is the empirical covariance of the source, Wji is a zero-mean random 

vector with covariance matrix S^y and e n is the vanishing difference between Eh^Y^lX^, U £ )/£ 

and h(Y\X). The point here is that for the purpose of obtaining a lower bound on the distortion 

attainable by linear encoders and decoders, we are replacing the optimization over infinitely many 

parameters {di}, {b{\, a, f3, 7, and S, by optimization over two £ x £ matrices, G and S^y, at the 

2 2 

possible rate loss of i log ^- + A log -%- + e n , which vanishes as £ and n grow. Thus, the parameter 

« e z £1 e v 
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I trades off the quality of the bound (its tightness) with the complexity of the optimization. 

Note that here our bounds are a bit weaker than in the finite-alphabet case, in the sense that 
they depend on the competing linear system with parameters {aj}, {hi}, a, (3, 7 and 8 (via e\ and 
e z ). However, the dependence on these parameters becomes weaker and weaker as I grows without 
bound. 

Appendix 

Some Concerns About the Proof of Theorem 3 in [11]. 

First, it should be pointed out that in [11, p. 140], the encoder was also assumed to be a finite- 
state machine, and so, in this appendix, following the notation of [11], the state of the encoder is 
denoted by z t and the state of the decoder is denoted by z' t . 

In [11], the joint probability distribution of all random variables was defined (in our notation) 
to be 

Putxtytvizz^u ,x ,y ,v ,z,z) 

= P(z,z')P u e(u e )P x e p e jZ (xV^)P(yV)Pv^,Z'(v e \x e ^'), (A.l) 

where P(z, z') is the expectation of the joint empirical distribution of the state of the encoder, 
denoted here by Z, and the state of the decoder, denoted here by Z', at the beginnings of all £- 
blocks, and P{y \x ) is the real conditional probability associated with the channel. First, observe 
that according to this definition, U £ is taken to be independent of Z and Z', which is inconsistent 
with the fact that the encoder state Z varies in response to the source and that there might be 
empirical dependencies between successive ^-blocks of the source. Also, according to this definition, 
Y e is independent of Z' given X , which similarly to the earlier comment, does not seem to settle 
with the fact that Z' responds to the decoder input V . 

Another issue is the use of the data processing theorem when it comes to empirical distributions. 
For example, the equality [11, p. 141, top] I(Z, U e ,X e ; V ) = I(Z, X e ;V e ) is questionable because 
there might be incidental empirical dependencies between U and V given (Z, X ). 

Finally, we have concerns regarding the way in which the delay was handled in [11], where 
the decoder output vt-d was simply renamed vt- It should be kept in mind that while the data 
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processing theorem applies to ^-blocks of {ut}, {xt}, {yt} and {vt~d}, the distortion is measured 
between ut and vt, and so, the discrepancy between the {vt} and its delayed version {vt-d} is real 
and cannot be handled by simple renaming. Indeed, in [11], the lower bound does not depend on 
d, a fact which is in contrast to the expectation that the larger is d, the better is the performance 
that can be achieved. 
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